Near Similarity Search and Plagiarism Analysis
نویسندگان
چکیده
Existing methods to text plagiarism analysis mainly base on “chunking”, a process of grouping a text into meaningful units each of which gets encoded by an integer number. Together theses numbers form a document’s signature or fingerprint. An overlap of two documents’ fingerprints indicate a possibly plagiarized text passage. Most approaches use MD5 hashes to construct fingerprints, which is bound up with two problems: (i) it is computationally expensive, (ii) a small chunk size must be chosen to identify matching passages, which additionally increases the effort for fingerprint computation, fingerprint comparison, and fingerprint storage. This paper proposes a new class of fingerprints that can be considered as an abstraction of the classical vector space model. These fingerprints operationalize the concept of “near similarity” and enable one to quickly identify candidate passages for plagiarism. Experiments show that a plagiarism analysis based on our fingerprints leads to a speed-up by a factor of five and higher—without compromising the recall performance. 1 Plagiarism Analysis Plagiarism is the act of claiming to be the author of material that someone else actually wrote (Encyclopædia Britannica, 2005). This definition relates to text documents, which is also the focus of this paper. Clearly, a question of central importance is to what extent such and similar tasks can be automated. Several techniques for plagiarism analysis have been proposed in the past— most of them rely on one of the following ideas. Substring Matching. Substring matching approaches try to identify maximal matches in pairs of strings, which then are used as plagiarism indicators (Gusfield (1997)). Typically, the substrings are represented in suffix trees, and graph-based measures are employed to capture the fraction of the plagiarized sections (Baker (1993); Monostori et al. (2002, 2000)). However, Finkel et al. (2002) as well as Baker (1993) propose the use of text compression algorithms to identify matches. Keyword Similarity. The idea here is to extract and to weight topic-identifying keywords from a document and to compare them to the keywords of other documents. If the similarity exceeds a threshold, the candidate documents are divided into smaller pieces, which then are compared recursively (Si et al. (1997); Fullam and Park (2002)). Note that this approach assumes that plagiarism usually happens in topically similar documents. Fingerprint Analysis. The most popular approach to plagiarism analysis is the detection of overlapping text sequences by means of fingerprinting: Documents are partitioned into term sequences, called chunks, from which digital digests are computed that form the document’s fingerprint. When the digests are inserted into a hashtable, collisions indicate matching sequences. Recent work that describes details and variants of this approach include Brin et al. (1995); Shivakumar and Garcia-Molina (1996); Finkel et al. (2002). 1.1 Contributions of this Paper The overall contribution of this paper relates to the usage of fuzzy-fingerprints as an effective tool for plagiarism analysis. To understand different intentions for similarity search and plagiarism analysis we first introduce the distinction of local and global similarity. In fact, fuzzy-fingerprints can be understood as a combination of both paradigms, where the parameter “chunk size” controls the degree of locality. In particular, we use this distinction to develop a taxonomy of methods for plagiarism analysis. These considerations are presented in the following section. Section 3 reports on experiments that quantify interesting properties of our approach. 2 Fingerprinting, Similarity, and Plagiarism Analysis In the context of information retrieval a fingerprint h(d) of a document d can be considered as a set of encoded substrings taken from d, which serve to identify d uniquely. Following Hoad and Zobel (2003), the process of creating a fingerprint comprises four areas that need consideration. 1. Substring Selection. From the original document substrings (chunks) are extracted according to some selection strategy. Such a strategy may consider positional, frequency-based, or structural information. 2. Substring Number. The substring number defines the fingerprint resolution. There is an obvious trade-off between fingerprint quality, processing effort, and storage requirements, which must be carefully balanced. The more information of a document is encoded in the fingerprint, the more reliably a possible collision of two fingerprints can be interpreted. 3. Substring Size. The substring size defines the fingerprint granularity. A fine granularity makes a fingerprint more susceptible to false matches, while with a coarse granularity fingerprinting becomes very sensitive to changes. 1 The term “signature” is sometimes also used in this connection. 4. Substring Encoding. The selected substrings are mapped onto integer numbers. Substring conversion establishes a hash operation where—aside from uniqueness and uniformity—also efficiency is an important issue (Ramakrishna and Zobel (1997)). For this, the popular MD5 hashing algorithm is often employed (Rivest (1992)). If the main issue is similarity analysis and not unique identification, the entire document d is used during the substring formation step—i. e., the union of all chunks covers the entire document. The total set of integer numbers represents the fingerprint h(d). Note that the chunks may not be of uniform length but should be formed with the analysis task in mind. 2.1 Local and Global Similarity Analysis For two documents A and B let h(A) and h(B) be their fingerprints with the respective resolutions |h(A)| and |h(B)|. Following Finkel et al. (2002), a similarity analysis between A and B that is based on h(A) and h(B) measures the portion of the fingerprint intersection: φlocal(A, B) = |h(A) ∩ h(B)|
منابع مشابه
Fingerprint-based Similarity Search and its Applications
This paper introduces a new technology and tools from the field of text-based information retrieval. The authors have developed – a fingerprint-based method for a highly efficient near similarity search, and – an application of this method to identify plagiarized passages in large document collections. The contribution of our work is twofold. Firstly, it is a search technology that enables a ne...
متن کاملCross-Language High Similarity Search: Why No Sub-linear Time Bound Can Be Expected
This paper contributes to an important variant of cross-language information retrieval, called cross-language high similarity search. Given a collection D of documents and a query q in a language different from the language of D, the task is to retrieve highly similar documents with respect to q. Use cases for this task include cross-language plagiarism detection and translation search. The cur...
متن کاملDistributed Similarity and Plagiarism Search
This paper describes the di erent approaches of plagiarism search, the methods used by the KOPI Online Plagiarism Search and Information Portal and, shows a distributed approach for building a plagiarism search system. This architecture adds scalability to the system, by allowing placing an arbitrary number of identical components into it. To reduce network tra c and enable secure transfer of t...
متن کاملInformation Retrieval Techniques for Corpus Filtering Applied to External Plagiarism Detection
We present a set of approaches for corpus filtering in the context of document external plagiarism detection. Producing filtered sets, and hence limiting the problem’s search space, can be a performance improvement and is used today in many real-world applications such as web search engines. With regards to document plagiarism detection, the database of documents to match the suspicious candida...
متن کاملScalable Document Fingerprinting (extended Abstract)
As more information becomes available electronically, document search based on textual similarity is becoming increasingly important, not only for locating documents online, but also for addressing internet variants of old problems such as plagiarism and copyright violation. This paper presents an online system that provides reliable search results using modest resources and scales up to data s...
متن کامل